A joint use of pooling and imputation for genotyping SNPs

Background Despite continuing technological advances, the cost for large-scale genotyping of a high number of samples can be prohibitive. The purpose of this study is to design a cost-saving strategy for SNP genotyping. We suggest making use of pooling, a group testing technique, to drop the amount of SNP arrays needed. We believe that this will be of the greatest importance for non-model organisms with more limited resources in terms of cost-efficient large-scale chips and high-quality reference genomes, such as application in wildlife monitoring, plant and animal breeding, but it is in essence species-agnostic. The proposed approach consists in grouping and mixing individual DNA samples into pools before testing these pools on bead-chips, such that the number of pools is less than the number of individual samples. We present a statistical estimation algorithm, based on the pooling outcomes, for inferring marker-wise the most likely genotype of every sample in each pool. Finally, we input these estimated genotypes into existing imputation algorithms. We compare the imputation performance from pooled data with the Beagle algorithm, and a local likelihood-aware phasing algorithm closely modeled on MaCH that we implemented. Results We conduct simulations based on human data from the 1000 Genomes Project, to aid comparison with other imputation studies. Based on the simulated data, we find that pooling impacts the genotype frequencies of the directly identifiable markers, without imputation. We also demonstrate how a combinatorial estimation of the genotype probabilities from the pooling design can improve the prediction performance of imputation models. Our algorithm achieves 93% concordance in predicting unassayed markers from pooled data, thus it outperforms the Beagle imputation model which reaches 80% concordance. We observe that the pooling design gives higher concordance for the rare variants than traditional low-density to high-density imputation commonly used for cost-effective genotyping of large cohorts. Conclusions We present promising results for combining a pooling scheme for SNP genotyping with computational genotype imputation on human data. These results could find potential applications in any context where the genotyping costs form a limiting factor on the study size, such as in marker-assisted selection in plant breeding. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04974-7.

where x = (x 1 , x 2 , ..., x n ) and z = (z 1 , z 2 , ..., z n ). The genotype vector z is incomplete in the sense some genotypes might be missing after pooling, these are considered as latent variables. x is a complete realization of G whereas the missing entries in z are indicated with the −1 value.

Formulation of the estimation problem for the missing data
For any pattern ψ in a pooling block, the objective is to estimate the most likely genotype distribution underlying any missing item ((r, c) = (1, 1)) in z from the set of vectors x that are consistent with the pooling pattern observed. We denote this estimated distributionπ :=π i withπ i computed for any block item i aŝ = (P r(G = 0|p 0i ; ψ), P r(G = 1|p 1i ; ψ), P r(G = 2|p 2i ; ψ)) (4) whereπ i = (p 0i ,p 1i ,p 2i ) are prior estimates of the genotype distribution that z i is sampled from. The mechanism defined in equation 4 is an inversion sampling of the priors (p 0i ,p 1i ,p 2i ) with respect to ψ. The prior estimates can be initialized freely as long as they form a probability simplex.
Such a problem is usually solved with a Maximum Likelihood Estimation (MLE) approach or with methods based on Expectation-Maximization (EM). In the next sections, we propose likelihood estimation algorithms tailored for pooled genotype data that compute the iteration defined in equation 4.

Maximum Likelihood type II estimations (ML II) Marginal Likelihood Maximization
In the ML II method, the likelihood of every valid layout is marginalized over the block parameters ψ, r, c. Therefore, the genotype distribution problem can be solved as a series of independent MLE for each set of block parameters. An illustrative example is provided for ψ = ((2, 2, 0), (2, 2, 0)) and n B = 16 in Figure 2.
For any pooling pattern, the pooled data z is completed by enumerating all corresponding valid x. The genotypes frequencies are evaluated from their expected counts in each x and the aggregated likelihood of the genotypes is maximized. The entire enumeration process is a variation to the classical MLE [3] where the distribution parameters are estimated from the proportions of data observed. The problem can also be formulated as a single-iteration Bayesian inference [2]. Finally, we obtain for ψ = ((2, 2, 0), (2, 2, 0)), (r, c) = (1, 1) an estimateπ = (0.214, 0.393, 0.393). In this model, we do not account for any prior information about the genotypes distribution, such that all genotypes are a priori equally likely when computing their frequencies in x.

Maximum Marginal Likelihood with heterozygous degeneracy
By representing the genotypes with the random variable G introduced earlier, the two heterozygous states carrying alleles pairs (0, 1) and (1, 0) are merged into a single heterozygous genotype G = 1. This phenomenon can be defined as heterozygotes degeneracy. In other words, there are 2 equivalent micro layouts for each G = 1 enumerated, such that if n 1 is the number of heterozygotes in the pooling block, 2 n1 is the order of the degeneracy. To correct for this in a computationally efficient manner, the expected heterozygotes counts can be doubled before maximizing the likelihood of every genotype. With the previous example on Figure 2, the expected genotypes proportions become (0.154, 0.564, 0.282). In compensation, when computing the final genotype frequencies, we downscale by a factor of 2 the arbitrary Eventually, this gives the same results as the previous calculation (π = (0.214, 0.393, 0.393)).

Self-consistent Expectation-Maximization for estimating genotypes distributions
Let us assume we have prior beliefs about the genotypes distribution that we can include in the model presented for the ML II. For every sample in any observed pattern ψ, we seek to iteratively compute an estimateπ of π|ψ, r, c = (p 0 ,p 1 ,p 2 ) given an initial prior genotypes distributionπ (0) . We assumeπ (0) is not conditioned on ψ, which implies the hypothesis the likelihood of ψ is independent of the alternate allele frequency (AAF). This hypothesis is questionable as it is for example very unlikely to observe ψ = ((4, 0, 0), (4, 0, 0)) i.e. a pure G = 0 homozygous block, if the AAF at the pooled marker is close to 1. Therefore, we have also implemented other versions of our algorithm that introduce specific prior probabilities for the alleles with respect to ψ (unpublished work).
The algorithm repeats the E, M, and rescaling steps until an appropriate stopping criterion is reached.

Genotypes distribution in a population with uniform allelic dosage
The second approach assumes both alleles at the genotyped SNP are present in equal proportions in the population, hence the genotypes frequencies areπ (0) = (0.25, 0.5, 0.25). As previously, we detail the case where ψ = ((2, 2, 0), (2, 2, 0)). The E step at iteration m = 1 becomes P r(x;π (1) , ψ) = (4/12) * 4! 2!1!1! * p  (III) Simulating pooling consists in a first encoding step which resolves the row-and column-pool's genotypes: 2 rows have genotype 0, 2 have genotype 1, none has genotype 2, and idem for the columnpools (noted as a pooling pattern ψ = ((2, 2, 0), (2, 2, 0))).The second step decodes the pooled data into individual genotypes. (I) The distribution to estimate is initialized at a prior value that reflects a uniform alleles dosage at HWE. (II) Data completing with valid layouts is performed similarily as in MML. For each combination, the prior distribution is used to compute its likelihood. We assume that the observed data (pooling pattern) is a continous mix of all valid layouts, which mixing proportions are proportional to their likelihood.
(III) The most likely genotypes counts given the prior distribution are derived from the marginal likelihoods weighted by their mixing proportions.
(IV) Rescaling is applied for accounting for heterozygotes degeneracy and layouts collapsing. Upscaling precedes prior updating, whereas downscaling occurs before computing the final MAP estimate of genotypes counts.